Self-supervised Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Supervised Learning and Transfer Learning

Supervised pretraining on large labeled datasets has led to successful transfer learning

  • ImageNet

  • Pretrain for fine-grained image classification of 1000 classes



  • Use feature representations for downstream tasks, e.g., object detection, image segmentation, and action recognition



But supervised pretraining comes at a cost …

  • Time-consuming and expensive to label datasets for new tasks
  • Domain expertise needed for specialized tasks
    • Radiologists to label medical images
    • Native speakers or language specialists for labeling text in different languages
  • To relieve the burden of labelling,
    • Semi-supervised learning
    • Weakly-supervised learning
    • Unsupervised learning

Self-supervised learning

  • Self-supervised learning (SSL): supervise using labels generated from the data without any manual or weak label sources
    • Sub-class of unsupervised learning
  • Idea: Hide or modify part of the input. Ask model to recover input or classify what changed
    • Self-supervised task referred to as the pretext task can be formulated using only unlabeled data
    • The features obtained from pretext tasks are transferred to downstream tasks like classification, object detection, and segmentation



Pretext Tasks

  • Solving the pretext tasks allow the model to learn good features.

  • We can automatically generate labels for the pretext tasks.



2. Pretext Tasks

2.1. Pretext Task - Context Prediction

  • After creating 9 patches from one input image, the classifier is trained on the location information between the middle and other patches
  • A pair of middle patch and other patch is given as the input for the network
  • Method to avoid trivial solutions
    • uneven spacing between patches



Carl Doersch, Abhinav Gupta, Alexei A. Efros, 2015, "Unsupervised Visual Representation Learning by Context Prediction," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1422-1430.

2.2. Pretext Task - Jigsaw Puzzle

  • Generate 9 patches from the input image
  • After shuffling the patches, learn a classifier that predicts permutations to return to the original position





Noroozi, M., and Favaro, P., 2016, "Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles," Computer Vision – ECCV 2016, 69–84.

2.3. Pretext Task - Image Colorization

  • Given a grayscale photograph as input, image colorization attacks the problem of hallucinating a plausible color version of the photograph
  • Transfer the trained encoder to the downstream task



Zhang, R., Isola, P., and Efros, A. A., 2016, "Colorful Image Colorization," Computer Vision – ECCV 2016, 649–666.

  • Training data generation for self-supervised learning



  • Network architecture



2.4. Pretext Task - Image Super-resolution

  • What if we prepared training pairs of (small, upscaled) images by downsampling millions of images we have freely available?
  • Training data generation for self-supervised learning



  • Network architecture



2.5. Pretext Task - Image Inpainting

  • What if we prepared training pairs of (corrupted, fixed) images by randomly removing part of images?
  • Training data generation for self-supervised learning



  • Network architecture



3. Self-supervised Learning

Benefits of Self-supervised Learning

  • Like supervised pretraining, can learn general-purpose feature representations for downstream tasks
  • Reduce expense of hand-labeling large datasets
  • Can leverage nearly unlimited unlabeled data available on the web

Pipeline of Self-supervised Learning

  1. Within pretext tasks, deep neural network learns visual features of input unlabeled data
  1. The learned parameters of the network remain fixed and the trained network serves as a pre-trained model for downstream tasks
  1. The pre-trained model is transferred to downstream tasks and is fine-tuned
  1. The performance of downstream tasks is used to evaluate the methodology used in pretext tasks to learn features from unlabeled data



Jing, L., & Tian, Y., 2021, "Self-supervised visual feature learning with Deep Neural Networks: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4037–4058.

Downstream Tasks

  • After transferring the neural network pre-trained by the pretext task, freeze the weights and build additional layers for the downstream tasks

  • Wide variety of downstream tasks

    • Classification
    • Regression
    • Object detection
    • Segmentation



4. Self-supervised Learning with TensorFlow

Pretext Task - Rotation

  • RotNet
  • Hypothesis: a model could recognize the correct rotation of an object only if it has the “visual commonsense” of what the object should look like

    • Self-supervised learning by rotating the entire input images
    • The model learns to predict which rotation is applied (4-way classification)



  • RotNet: Supervised vs Self-supervised
    • The accuracy gap between the RotNet based model and the fully supervised Network-In-Network (NIN) model is very small, only 1.64% points
    • We do not need data labels to train the RotNet based model but achieved similar accuracy with that of the model which used data labels for training

Import Library

In [1]:
import tensorflow as tf
import numpy as np 
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
import matplotlib.pyplot as plt

Load MNIST Data

In [2]:
(X_train, Y_train), (X_test, Y_test) = keras.datasets.mnist.load_data()
XX_train = X_train[10000:11000]
YY_train = Y_train[10000:11000]
X_train = X_train[:10000]
Y_train = Y_train[:10000]
XX_test = X_test[300:600]
YY_test = Y_test[300:600]
X_test = X_test[:300]
Y_test = Y_test[:300]
In [3]:
print('shape of x_train:', X_train.shape)
print('shape of y_train:', Y_train.shape)
print('shape of xx_train:', XX_train.shape)
print('shape of yy_train:', YY_train.shape)
print('shape of x_test:', X_test.shape)
print('shape of y_test:', Y_test.shape)
print('shape of xx_test:', XX_test.shape)
print('shape of yy_test:', YY_test.shape)
shape of x_train: (10000, 28, 28)
shape of y_train: (10000,)
shape of xx_train: (1000, 28, 28)
shape of yy_train: (1000,)
shape of x_test: (300, 28, 28)
shape of y_test: (300,)
shape of xx_test: (300, 28, 28)
shape of yy_test: (300,)

4.1. Build RotNet for Pretext Task



Dataset for Pretext Task (Rotation)

  • Need to generate rotated images and their labels to train the model for pretext task
    • [1, 0, 0, 0]: 0$^\circ$ rotation
    • [0, 1, 0, 0]; 90$^\circ$ rotation
    • [0, 0, 1, 0]: 180$^\circ$ rotation
    • [0, 0, 0, 1]; 270$^\circ$ rotation
In [4]:
n_samples = X_train.shape[0]
X_rotate = np.zeros(shape = (n_samples*4,
                             X_train.shape[1],
                             X_train.shape[2]))
Y_rotate = np.zeros(shape = (n_samples*4, 4))

for i in range(n_samples):    
    img = X_train[i]
    X_rotate[4*i-4] = img
    Y_rotate[4*i-4] = tf.one_hot([0], depth = 4)
    
    # 90 degrees rotation
    X_rotate[4*i-3] = np.rot90(img, k = 1)
    Y_rotate[4*i-3] = tf.one_hot([1], depth = 4)
    
    # 180 degrees rotation
    X_rotate[4*i-2] = np.rot90(img, k = 2)
    Y_rotate[4*i-2] = tf.one_hot([2], depth = 4) 
    
    # 270 degrees rotation
    X_rotate[4*i-1] = np.rot90(img, k = 3)
    Y_rotate[4*i-1] = tf.one_hot([3], depth = 4)

Plot Dataset for Pretext Task (Rotation)

In [5]:
plt.subplots(figsize = (10, 10))

plt.subplot(141)
plt.imshow(X_rotate[12], cmap = 'gray')
plt.axis('off') 

plt.subplot(142)
plt.imshow(X_rotate[13], cmap = 'gray')
plt.axis('off') 

plt.subplot(143)
plt.imshow(X_rotate[14], cmap = 'gray')
plt.axis('off') 

plt.subplot(144)
plt.imshow(X_rotate[15], cmap = 'gray')
plt.axis('off') 
Out[5]:
(-0.5, 27.5, 27.5, -0.5)
In [6]:
X_rotate = X_rotate.reshape(-1,28,28,1)

Build Model for Pretext Task (Rotation)

In [7]:
model_pretext = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 64, 
                           kernel_size = (3,3), 
                           strides = (2,2), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),
    
    tf.keras.layers.MaxPool2D(pool_size = (2, 2), 
                              strides = (2, 2)),
    
    tf.keras.layers.Conv2D(filters = 32, 
                           kernel_size = (3,3), 
                           strides = (1,1), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (7, 7, 64)),
    
    tf.keras.layers.MaxPool2D(pool_size = (2, 2), 
                              strides = (2, 2)),
    
    tf.keras.layers.Conv2D(filters = 16, 
                           kernel_size = (3,3),
                           strides = (2,2), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (3, 3, 32)),
    
    tf.keras.layers.Flatten(),
    
    tf.keras.layers.Dense(units = 4, activation = 'softmax')
    
])
model_pretext.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 14, 14, 64)        640       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 7, 7, 32)          18464     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 3, 3, 32)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 2, 2, 16)          4624      
_________________________________________________________________
flatten (Flatten)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 4)                 260       
=================================================================
Total params: 23,988
Trainable params: 23,988
Non-trainable params: 0
_________________________________________________________________
  • Training the model for the pretext task



In [8]:
model_pretext.compile(optimizer = 'adam',
                      loss = 'categorical_crossentropy',
                      metrics = 'accuracy')

model_pretext.fit(X_rotate, 
                  Y_rotate, 
                  batch_size = 192, 
                  epochs = 50,
                  verbose = 2, 
                  shuffle = False)
Epoch 1/50
209/209 - 5s - loss: 0.9973 - accuracy: 0.7692
Epoch 2/50
209/209 - 5s - loss: 0.1854 - accuracy: 0.9366
Epoch 3/50
209/209 - 4s - loss: 0.1246 - accuracy: 0.9571
Epoch 4/50
209/209 - 5s - loss: 0.0957 - accuracy: 0.9678
Epoch 5/50
209/209 - 5s - loss: 0.0736 - accuracy: 0.9744
Epoch 6/50
209/209 - 5s - loss: 0.0647 - accuracy: 0.9782
Epoch 7/50
209/209 - 4s - loss: 0.0633 - accuracy: 0.9785
Epoch 8/50
209/209 - 5s - loss: 0.0554 - accuracy: 0.9804
Epoch 9/50
209/209 - 5s - loss: 0.0469 - accuracy: 0.9834
Epoch 10/50
209/209 - 5s - loss: 0.0506 - accuracy: 0.9826
Epoch 11/50
209/209 - 5s - loss: 0.0406 - accuracy: 0.9855
Epoch 12/50
209/209 - 5s - loss: 0.0390 - accuracy: 0.9862
Epoch 13/50
209/209 - 5s - loss: 0.0404 - accuracy: 0.9858
Epoch 14/50
209/209 - 4s - loss: 0.0517 - accuracy: 0.9811
Epoch 15/50
209/209 - 5s - loss: 0.0473 - accuracy: 0.9823
Epoch 16/50
209/209 - 5s - loss: 0.0504 - accuracy: 0.9818
Epoch 17/50
209/209 - 5s - loss: 0.0357 - accuracy: 0.9871
Epoch 18/50
209/209 - 5s - loss: 0.0267 - accuracy: 0.9908
Epoch 19/50
209/209 - 6s - loss: 0.0272 - accuracy: 0.9900
Epoch 20/50
209/209 - 5s - loss: 0.0273 - accuracy: 0.9901
Epoch 21/50
209/209 - 5s - loss: 0.0190 - accuracy: 0.9926
Epoch 22/50
209/209 - 5s - loss: 0.0263 - accuracy: 0.9903
Epoch 23/50
209/209 - 5s - loss: 0.0227 - accuracy: 0.9913
Epoch 24/50
209/209 - 4s - loss: 0.0249 - accuracy: 0.9908
Epoch 25/50
209/209 - 5s - loss: 0.0344 - accuracy: 0.9883
Epoch 26/50
209/209 - 5s - loss: 0.0222 - accuracy: 0.9916
Epoch 27/50
209/209 - 6s - loss: 0.0237 - accuracy: 0.9919
Epoch 28/50
209/209 - 5s - loss: 0.0214 - accuracy: 0.9926
Epoch 29/50
209/209 - 4s - loss: 0.0244 - accuracy: 0.9913
Epoch 30/50
209/209 - 4s - loss: 0.0238 - accuracy: 0.9916
Epoch 31/50
209/209 - 4s - loss: 0.0317 - accuracy: 0.9894
Epoch 32/50
209/209 - 4s - loss: 0.0209 - accuracy: 0.9926
Epoch 33/50
209/209 - 4s - loss: 0.0226 - accuracy: 0.9926
Epoch 34/50
209/209 - 4s - loss: 0.0171 - accuracy: 0.9943
Epoch 35/50
209/209 - 5s - loss: 0.0098 - accuracy: 0.9965
Epoch 36/50
209/209 - 5s - loss: 0.0200 - accuracy: 0.9932
Epoch 37/50
209/209 - 4s - loss: 0.0213 - accuracy: 0.9933
Epoch 38/50
209/209 - 4s - loss: 0.0174 - accuracy: 0.9940
Epoch 39/50
209/209 - 5s - loss: 0.0094 - accuracy: 0.9963
Epoch 40/50
209/209 - 6s - loss: 0.0214 - accuracy: 0.9931
Epoch 41/50
209/209 - 6s - loss: 0.0445 - accuracy: 0.9866
Epoch 42/50
209/209 - 5s - loss: 0.0168 - accuracy: 0.9941
Epoch 43/50
209/209 - 5s - loss: 0.0180 - accuracy: 0.9943
Epoch 44/50
209/209 - 5s - loss: 0.0131 - accuracy: 0.9951
Epoch 45/50
209/209 - 5s - loss: 0.0172 - accuracy: 0.9942
Epoch 46/50
209/209 - 5s - loss: 0.0120 - accuracy: 0.9956
Epoch 47/50
209/209 - 5s - loss: 0.0153 - accuracy: 0.9949
Epoch 48/50
209/209 - 6s - loss: 0.0217 - accuracy: 0.9931
Epoch 49/50
209/209 - 6s - loss: 0.0212 - accuracy: 0.9930
Epoch 50/50
209/209 - 6s - loss: 0.0116 - accuracy: 0.9961
Out[8]:
<tensorflow.python.keras.callbacks.History at 0x227c7e9f908>

4.2. Build Downstream Task (MNIST Image Classification)

  • Freezing trained parameters to transfer them for the downstream task
In [9]:
model_pretext.trainable = False

Reshape Dataset

In [10]:
XX_train = XX_train.reshape(-1,28,28,1)
XX_test = XX_test.reshape(-1,28,28,1)
YY_train = tf.one_hot(YY_train, 10,on_value = 1.0, off_value = 0.0)
YY_test = tf.one_hot(YY_test, 10,on_value = 1.0, off_value = 0.0)

Build Model

  • Model: two convolution layers and one fully connected layer
    • Two convolution layers are transferred from the model for the pretext task
    • Single fully connected layer is trained only



In [11]:
model_downstream = tf.keras.models.Sequential([ 
    model_pretext.get_layer(index = 0), 
    model_pretext.get_layer(index = 1),   
    model_pretext.get_layer(index = 2),  
    model_pretext.get_layer(index = 3),
    
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')    
])

model_downstream.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 14, 14, 64)        640       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 7, 7, 32)          18464     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 3, 3, 32)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 288)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                2890      
=================================================================
Total params: 21,994
Trainable params: 2,890
Non-trainable params: 19,104
_________________________________________________________________
In [12]:
model_downstream.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001,momentum = 0.9),
                         loss = 'categorical_crossentropy',
                         metrics = 'accuracy')

model_downstream.fit(XX_train, 
                     YY_train, 
                     batch_size = 64,
                     validation_split = 0.2, 
                     epochs = 50, 
                     verbose = 2,
                     callbacks = tf.keras.callbacks.EarlyStopping(monitor = 'accuracy', patience = 7))
Epoch 1/50
13/13 - 0s - loss: 190.8943 - accuracy: 0.1300 - val_loss: 286.7282 - val_accuracy: 0.1500
Epoch 2/50
13/13 - 0s - loss: 155.8605 - accuracy: 0.3150 - val_loss: 85.4378 - val_accuracy: 0.4600
Epoch 3/50
13/13 - 0s - loss: 44.9868 - accuracy: 0.6175 - val_loss: 31.6631 - val_accuracy: 0.7150
Epoch 4/50
13/13 - 0s - loss: 21.7824 - accuracy: 0.7900 - val_loss: 24.9688 - val_accuracy: 0.7400
Epoch 5/50
13/13 - 0s - loss: 13.8247 - accuracy: 0.8275 - val_loss: 13.7446 - val_accuracy: 0.8600
Epoch 6/50
13/13 - 0s - loss: 8.6790 - accuracy: 0.8675 - val_loss: 10.6283 - val_accuracy: 0.8750
Epoch 7/50
13/13 - 0s - loss: 6.2890 - accuracy: 0.8863 - val_loss: 12.3585 - val_accuracy: 0.8600
Epoch 8/50
13/13 - 0s - loss: 4.4847 - accuracy: 0.9150 - val_loss: 9.8872 - val_accuracy: 0.8550
Epoch 9/50
13/13 - 0s - loss: 3.9848 - accuracy: 0.9175 - val_loss: 9.3085 - val_accuracy: 0.8900
Epoch 10/50
13/13 - 0s - loss: 2.7134 - accuracy: 0.9362 - val_loss: 7.5365 - val_accuracy: 0.9150
Epoch 11/50
13/13 - 0s - loss: 2.3395 - accuracy: 0.9312 - val_loss: 9.4580 - val_accuracy: 0.9000
Epoch 12/50
13/13 - 0s - loss: 2.6337 - accuracy: 0.9312 - val_loss: 8.8336 - val_accuracy: 0.8950
Epoch 13/50
13/13 - 0s - loss: 3.3520 - accuracy: 0.9162 - val_loss: 9.7958 - val_accuracy: 0.8650
Epoch 14/50
13/13 - 0s - loss: 1.8601 - accuracy: 0.9488 - val_loss: 8.0040 - val_accuracy: 0.9250
Epoch 15/50
13/13 - 0s - loss: 1.4604 - accuracy: 0.9575 - val_loss: 8.1195 - val_accuracy: 0.8900
Epoch 16/50
13/13 - 0s - loss: 1.6075 - accuracy: 0.9538 - val_loss: 10.1115 - val_accuracy: 0.8850
Epoch 17/50
13/13 - 0s - loss: 1.5618 - accuracy: 0.9550 - val_loss: 8.2215 - val_accuracy: 0.9150
Epoch 18/50
13/13 - 0s - loss: 1.2585 - accuracy: 0.9638 - val_loss: 7.9202 - val_accuracy: 0.9100
Epoch 19/50
13/13 - 0s - loss: 1.3084 - accuracy: 0.9463 - val_loss: 8.5628 - val_accuracy: 0.8950
Epoch 20/50
13/13 - 0s - loss: 1.2891 - accuracy: 0.9475 - val_loss: 8.3959 - val_accuracy: 0.8950
Epoch 21/50
13/13 - 0s - loss: 0.8199 - accuracy: 0.9725 - val_loss: 7.5670 - val_accuracy: 0.9000
Epoch 22/50
13/13 - 0s - loss: 0.4015 - accuracy: 0.9787 - val_loss: 8.6314 - val_accuracy: 0.8900
Epoch 23/50
13/13 - 0s - loss: 0.7128 - accuracy: 0.9663 - val_loss: 7.4462 - val_accuracy: 0.9200
Epoch 24/50
13/13 - 0s - loss: 0.8462 - accuracy: 0.9600 - val_loss: 7.7180 - val_accuracy: 0.9000
Epoch 25/50
13/13 - 0s - loss: 0.4770 - accuracy: 0.9762 - val_loss: 7.3963 - val_accuracy: 0.9150
Epoch 26/50
13/13 - 0s - loss: 0.4459 - accuracy: 0.9787 - val_loss: 8.0195 - val_accuracy: 0.8900
Epoch 27/50
13/13 - 0s - loss: 0.3073 - accuracy: 0.9775 - val_loss: 7.3111 - val_accuracy: 0.9200
Epoch 28/50
13/13 - 0s - loss: 0.2940 - accuracy: 0.9775 - val_loss: 9.1896 - val_accuracy: 0.9050
Epoch 29/50
13/13 - 0s - loss: 0.5168 - accuracy: 0.9737 - val_loss: 7.9862 - val_accuracy: 0.9250
Out[12]:
<tensorflow.python.keras.callbacks.History at 0x227d181dd30>

Downstream Task Trained Result (Image Classification Result)

In [13]:
name = ['0', '1', '2', '3', '4', '5','6', '7', '8', '9']
idx = 9
img = XX_train[idx].reshape(-1,28,28,1)
label = YY_train[idx]
predict = model_downstream.predict(img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.imshow(img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(name[mypred[0]]))
Prediction : 2

4.3. Build Supervised Model for Comparison

  • Convolution Neural Networks for MNIST image classification
    • Model: Same model architecture with the model for the downstream task
    • The number of total parameter is the same with the model for the downstream task, but is has zero non-trainable parameters
In [14]:
model_sup = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 64, 
                           kernel_size = (3,3), 
                           strides = (2,2), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),
    
    tf.keras.layers.MaxPool2D(pool_size = (2, 2), 
                              strides = (2, 2)),
    
    tf.keras.layers.Conv2D(filters = 32, 
                           kernel_size = (3,3), 
                           strides = (1,1), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (7, 7, 64)),
    
    tf.keras.layers.MaxPool2D(pool_size = (2, 2), 
                              strides = (2, 2)),
    
    tf.keras.layers.Flatten(),
    
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
    
])
model_sup.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_3 (Conv2D)            (None, 14, 14, 64)        640       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 7, 7, 32)          18464     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 3, 3, 32)          0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 288)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                2890      
=================================================================
Total params: 21,994
Trainable params: 21,994
Non-trainable params: 0
_________________________________________________________________
In [15]:
model_sup.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001,momentum = 0.9),
                  loss = 'categorical_crossentropy',
                  metrics = 'accuracy')
model_sup.fit(XX_train,
              YY_train,
              batch_size = 32,
              validation_split = 0.2,
              epochs = 50,
              verbose = 2,
              callbacks = tf.keras.callbacks.EarlyStopping(monitor = 'accuracy', patience = 7))
Epoch 1/50
25/25 - 0s - loss: 11.4118 - accuracy: 0.1437 - val_loss: 2.2270 - val_accuracy: 0.1350
Epoch 2/50
25/25 - 0s - loss: 2.0790 - accuracy: 0.2087 - val_loss: 1.8073 - val_accuracy: 0.3800
Epoch 3/50
25/25 - 0s - loss: 1.6150 - accuracy: 0.4625 - val_loss: 1.3722 - val_accuracy: 0.5200
Epoch 4/50
25/25 - 0s - loss: 1.1714 - accuracy: 0.6237 - val_loss: 1.1617 - val_accuracy: 0.6000
Epoch 5/50
25/25 - 0s - loss: 0.9273 - accuracy: 0.6938 - val_loss: 1.0794 - val_accuracy: 0.6800
Epoch 6/50
25/25 - 0s - loss: 0.7822 - accuracy: 0.7287 - val_loss: 0.8643 - val_accuracy: 0.7300
Epoch 7/50
25/25 - 0s - loss: 0.6739 - accuracy: 0.7550 - val_loss: 0.9741 - val_accuracy: 0.6650
Epoch 8/50
25/25 - 0s - loss: 0.6540 - accuracy: 0.7862 - val_loss: 0.8546 - val_accuracy: 0.7450
Epoch 9/50
25/25 - 0s - loss: 0.5504 - accuracy: 0.8037 - val_loss: 0.8088 - val_accuracy: 0.6950
Epoch 10/50
25/25 - 0s - loss: 0.4761 - accuracy: 0.8288 - val_loss: 0.7965 - val_accuracy: 0.7500
Epoch 11/50
25/25 - 0s - loss: 0.4663 - accuracy: 0.8487 - val_loss: 0.8886 - val_accuracy: 0.7500
Epoch 12/50
25/25 - 0s - loss: 0.4227 - accuracy: 0.8550 - val_loss: 0.8157 - val_accuracy: 0.7500
Epoch 13/50
25/25 - 0s - loss: 0.3805 - accuracy: 0.8600 - val_loss: 0.9497 - val_accuracy: 0.7150
Epoch 14/50
25/25 - 0s - loss: 0.3937 - accuracy: 0.8612 - val_loss: 0.8810 - val_accuracy: 0.7450
Epoch 15/50
25/25 - 0s - loss: 0.3316 - accuracy: 0.8788 - val_loss: 1.0198 - val_accuracy: 0.7450
Epoch 16/50
25/25 - 0s - loss: 0.3067 - accuracy: 0.8950 - val_loss: 0.8269 - val_accuracy: 0.7300
Epoch 17/50
25/25 - 0s - loss: 0.3170 - accuracy: 0.8813 - val_loss: 0.9856 - val_accuracy: 0.7350
Epoch 18/50
25/25 - 0s - loss: 0.3231 - accuracy: 0.8925 - val_loss: 0.9370 - val_accuracy: 0.7450
Epoch 19/50
25/25 - 0s - loss: 0.2559 - accuracy: 0.9038 - val_loss: 0.8675 - val_accuracy: 0.7600
Epoch 20/50
25/25 - 0s - loss: 0.1949 - accuracy: 0.9337 - val_loss: 0.9243 - val_accuracy: 0.7800
Epoch 21/50
25/25 - 0s - loss: 0.1898 - accuracy: 0.9287 - val_loss: 1.1273 - val_accuracy: 0.7400
Epoch 22/50
25/25 - 0s - loss: 0.1746 - accuracy: 0.9362 - val_loss: 1.0892 - val_accuracy: 0.7450
Epoch 23/50
25/25 - 0s - loss: 0.1834 - accuracy: 0.9275 - val_loss: 1.0857 - val_accuracy: 0.7550
Epoch 24/50
25/25 - 0s - loss: 0.1262 - accuracy: 0.9663 - val_loss: 1.2073 - val_accuracy: 0.7450
Epoch 25/50
25/25 - 0s - loss: 0.1446 - accuracy: 0.9413 - val_loss: 1.1864 - val_accuracy: 0.7550
Epoch 26/50
25/25 - 0s - loss: 0.1261 - accuracy: 0.9650 - val_loss: 1.0797 - val_accuracy: 0.7650
Epoch 27/50
25/25 - 0s - loss: 0.1031 - accuracy: 0.9613 - val_loss: 1.1742 - val_accuracy: 0.7800
Epoch 28/50
25/25 - 0s - loss: 0.1242 - accuracy: 0.9513 - val_loss: 1.1519 - val_accuracy: 0.7600
Epoch 29/50
25/25 - 0s - loss: 0.1096 - accuracy: 0.9588 - val_loss: 1.2944 - val_accuracy: 0.7800
Epoch 30/50
25/25 - 0s - loss: 0.0943 - accuracy: 0.9688 - val_loss: 1.1582 - val_accuracy: 0.7700
Epoch 31/50
25/25 - 0s - loss: 0.0782 - accuracy: 0.9712 - val_loss: 1.2333 - val_accuracy: 0.7700
Epoch 32/50
25/25 - 0s - loss: 0.0651 - accuracy: 0.9737 - val_loss: 1.3444 - val_accuracy: 0.7600
Epoch 33/50
25/25 - 0s - loss: 0.0492 - accuracy: 0.9862 - val_loss: 1.3857 - val_accuracy: 0.7700
Epoch 34/50
25/25 - 0s - loss: 0.0587 - accuracy: 0.9775 - val_loss: 1.4542 - val_accuracy: 0.7450
Epoch 35/50
25/25 - 0s - loss: 0.0668 - accuracy: 0.9762 - val_loss: 1.3023 - val_accuracy: 0.7700
Epoch 36/50
25/25 - 0s - loss: 0.0533 - accuracy: 0.9812 - val_loss: 1.5603 - val_accuracy: 0.7550
Epoch 37/50
25/25 - 0s - loss: 0.0993 - accuracy: 0.9675 - val_loss: 1.4142 - val_accuracy: 0.7500
Epoch 38/50
25/25 - 0s - loss: 0.0654 - accuracy: 0.9737 - val_loss: 1.3829 - val_accuracy: 0.7900
Epoch 39/50
25/25 - 0s - loss: 0.0611 - accuracy: 0.9800 - val_loss: 1.4364 - val_accuracy: 0.7700
Epoch 40/50
25/25 - 0s - loss: 0.0714 - accuracy: 0.9725 - val_loss: 1.4401 - val_accuracy: 0.7800
Out[15]:
<tensorflow.python.keras.callbacks.History at 0x227c1ad7470>

Compare Self-supervised Learning and Supervised Learning

  • Pretext Task
    • Input data: 10,000 MNIST images without labels
  • Downstream Task and Supervised Learning (for performance comparison)
    • Training data: 1,000 MNIST images with labels
    • Test data: 300 MNIST images with labels
  • Key concepts
    • For transfer learning, we used to train networks like VGG 16 with large image dataset with labels such as ImageNet
    • With self-supervised learning, we train such networks with unlabeled image datasets which have larger number of data than labeled image datasets have and perform transfer learning
    • Comparing downstream task performance with that of supervised learning is equal to comparing the performance of (self-supervised) transfer learning and supervised learning performance
In [16]:
test_self = model_downstream.evaluate(XX_test,YY_test,batch_size = 64,verbose = 2)

print("")
print('Self-supervised Learning Accuracy on Test Data:  {:.2f}%'.format(test_self[1]*100))
5/5 - 0s - loss: 6.6431 - accuracy: 0.8633

Self-supervised Learning Accuracy on Test Data:  86.33%
In [17]:
test_sup = model_sup.evaluate(XX_test,YY_test,batch_size = 64, verbose = 2)

print("")
print('Supervised Learning Accuracy on Test Data:  {:.2f}%'.format(test_sup[1]*100))
5/5 - 0s - loss: 1.5511 - accuracy: 0.7500

Supervised Learning Accuracy on Test Data:  75.00%
In [18]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')